Medical Insurance Prediction¶

Alper ŞAHİN¶

https://www.kaggle.com/code/alperahin111/insurance-prediction/notebook

**Libraries:**

In [1]:
import pandas as pd           
import numpy as np              
from scipy import stats           
import statsmodels.api as sm    

import seaborn as sns           
import matplotlib.pyplot as plt  
import plotly.express as px   

**Multiple Regression Problem:**

**Exploratory Data Analysis:**

In [2]:
dataset = pd.read_csv('expenses.csv')
dataset
Out[2]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
... ... ... ... ... ... ... ...
1333 50 male 30.970 3 no northwest 10600.54830
1334 18 female 31.920 0 no northeast 2205.98080
1335 18 female 36.850 0 no southeast 1629.83350
1336 21 female 25.800 0 no southwest 2007.94500
1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

In [3]:
shape_first = dataset.shape[0]
dataset.drop_duplicates(inplace = True)
print( shape_first - dataset.shape[0], ' --> removed from raw data.' )
dataset
1  --> removed from raw data.
Out[3]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
... ... ... ... ... ... ... ...
1333 50 male 30.970 3 no northwest 10600.54830
1334 18 female 31.920 0 no northeast 2205.98080
1335 18 female 36.850 0 no southeast 1629.83350
1336 21 female 25.800 0 no southwest 2007.94500
1337 61 female 29.070 0 yes northwest 29141.36030

1337 rows × 7 columns

In [4]:
b = sns.countplot(data=dataset, x='region')

b.axes.set_title("Distribution of Region",fontsize=20, fontweight = 'bold')
b.set_xlabel("Regions",fontsize=14,fontweight = 'bold')
b.set_ylabel("Sample Sizes",fontsize=14, fontweight = 'bold')
b.tick_params(labelsize=14)
plt.show()
In [5]:
b = sns.countplot(data=dataset, x='sex')

b.axes.set_title("Distribution of Gender",fontsize=20, fontweight = 'bold')
b.set_xlabel("Gender",fontsize=14,fontweight = 'bold')
b.set_ylabel("Sample Sizes",fontsize=14, fontweight = 'bold')
b.tick_params(labelsize=14)
plt.show()
In [6]:
b = sns.countplot(data=dataset, x='smoker')

b.axes.set_title("Distribution of Smoking Habit",fontsize=20, fontweight = 'bold')
b.set_xlabel("Smoking Habit",fontsize=14,fontweight = 'bold')
b.set_ylabel("Sample Sizes",fontsize=14, fontweight = 'bold')
b.tick_params(labelsize=14)
plt.show()
In [7]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1337 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB
In [8]:
[ print(str(col+ ':'), dataset[col].unique()) for col in ['sex', 'smoker', 'region'] ]
sex: ['female' 'male']
smoker: ['yes' 'no']
region: ['southwest' 'southeast' 'northwest' 'northeast']
Out[8]:
[None, None, None]

**Dummy Variables for categorical Values:**

In [9]:
# Sex
dataset['Sex_int'] = dataset['sex'].apply(lambda x: 0 if 'female' == x   else 1)

'''
Female = 0

Male   = 1

'''

# Smoking Habit
dataset['smoker_int'] = dataset['smoker'].apply(lambda x: 0 if 'no' == x   else 1)

'''
No  = 0

Yes = 1

'''

# Bölge
dataset_new = pd.get_dummies(dataset, columns = ['region'] )

dataset_new
Out[9]:
age sex bmi children smoker charges Sex_int smoker_int region_northeast region_northwest region_southeast region_southwest
0 19 female 27.900 0 yes 16884.92400 0 1 0 0 0 1
1 18 male 33.770 1 no 1725.55230 1 0 0 0 1 0
2 28 male 33.000 3 no 4449.46200 1 0 0 0 1 0
3 33 male 22.705 0 no 21984.47061 1 0 0 1 0 0
4 32 male 28.880 0 no 3866.85520 1 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
1333 50 male 30.970 3 no 10600.54830 1 0 0 1 0 0
1334 18 female 31.920 0 no 2205.98080 0 0 1 0 0 0
1335 18 female 36.850 0 no 1629.83350 0 0 0 0 1 0
1336 21 female 25.800 0 no 2007.94500 0 0 0 0 0 1
1337 61 female 29.070 0 yes 29141.36030 0 1 0 1 0 0

1337 rows × 12 columns

In [10]:
'''New columns belonging to 4 different regions were created. If any three of these columns are zero, the remaining column 
   can be also known; consequently it was removed from the dataset. Otherwise, multicollinearity problem arises and the 
   parameters cannot be predicted.

Example : region_northeast can be determined according to related columns.

          region_northwest  region_southeast  region_southwest
                  0                 0                  0
      
'''
dataset_new.drop(['region_northeast', 'sex', 'smoker'], axis=1, inplace = True)
dataset_new
Out[10]:
age bmi children charges Sex_int smoker_int region_northwest region_southeast region_southwest
0 19 27.900 0 16884.92400 0 1 0 0 1
1 18 33.770 1 1725.55230 1 0 0 1 0
2 28 33.000 3 4449.46200 1 0 0 1 0
3 33 22.705 0 21984.47061 1 0 1 0 0
4 32 28.880 0 3866.85520 1 0 1 0 0
... ... ... ... ... ... ... ... ... ...
1333 50 30.970 3 10600.54830 1 0 1 0 0
1334 18 31.920 0 2205.98080 0 0 0 0 0
1335 18 36.850 0 1629.83350 0 0 0 1 0
1336 21 25.800 0 2007.94500 0 0 0 0 1
1337 61 29.070 0 29141.36030 0 1 1 0 0

1337 rows × 9 columns

In [11]:
def sns_graph(df, parameter_x, parameter_y, hue):
    b = sns.scatterplot(data = df, x = parameter_x, y = parameter_y, hue = hue)
    plt.title( str(parameter_x) + ' - ' + str(parameter_y), fontweight = 'bold', fontsize=15 )
    b.set_xlabel(str(parameter_x),fontsize=14,fontweight = 'bold')
    b.set_ylabel(str(parameter_y),fontsize=14, fontweight = 'bold')
    b.tick_params(labelsize=14)
In [12]:
sns_graph(dataset, 'age', 'charges', None)

# When the Age Insurance Payment graph is drawn, it is clearly seen that the insurance payment is concentrated in three different regions.
# It can be interpreted from the graph that insurance payments increase as you get older.
In [13]:
sns_graph(dataset, 'age', 'charges', 'region')
# It is seen that there is not much difference in insurance payments according to regions.
In [14]:
sns_graph(dataset, 'age', 'charges', 'smoker')

# It is observed that the general population of non-smokers makes insurance payments of 15,000 or less.
In [15]:
fig = px.scatter_3d(dataset, x="age", y= 'bmi', z="charges",color = 'smoker',
                 color_continuous_scale=px.colors.sequential.RdBu, title=" Age- Charges- BMI")

fig.update_layout(showlegend=False, paper_bgcolor = 'rgba(0, 0, 0, 0)', plot_bgcolor = 'rgba(0, 0, 0, 0)' )
fig.update_xaxes(linewidth = 2, linecolor ='black')
fig.update_yaxes(linewidth = 2, linecolor ='black')

fig.show()
In [16]:
dataset.groupby(by = ['region', 'smoker','sex']).count()
Out[16]:
age bmi children charges Sex_int smoker_int
region smoker sex
northeast no female 132 132 132 132 132 132
male 125 125 125 125 125 125
yes female 29 29 29 29 29 29
male 38 38 38 38 38 38
northwest no female 135 135 135 135 135 135
male 131 131 131 131 131 131
yes female 29 29 29 29 29 29
male 29 29 29 29 29 29
southeast no female 139 139 139 139 139 139
male 134 134 134 134 134 134
yes female 36 36 36 36 36 36
male 55 55 55 55 55 55
southwest no female 141 141 141 141 141 141
male 126 126 126 126 126 126
yes female 21 21 21 21 21 21
male 37 37 37 37 37 37

**Outlier Detection & Remove Outliers**

In [17]:
def box_plot (df, parameter):
    plt.rcParams["figure.figsize"] = [5.50, 5.50]
    sns.boxplot(data = pd.DataFrame(df[parameter]))
    plt.title(str(parameter)+ ' Box Plot', fontsize=18, fontweight ='bold')
    plt.tick_params(axis='both', which='major', labelsize=14)
    plt.show()
In [18]:
[box_plot(dataset_new, parameter) for parameter in ['age', 'bmi', 'children','charges']]
Out[18]:
[None, None, None, None]
In [19]:
def outlier_detection(df, column_name):
    df_out = df.copy()
    Q1 = np.percentile(df[str(column_name)], 25,
                       interpolation = 'midpoint')

    Q3 = np.percentile(df[str(column_name)], 75,
                       interpolation = 'midpoint')
    IQR = Q3 - Q1

    # Upper bound
    upper = np.where(df[str(column_name)] >= (Q3+1.5*IQR))
    # Lower bound
    lower = np.where(df[str(column_name)] <= (Q1-1.5*IQR))

    ''' Removing the Outliers '''
    df_out.drop(upper[0], inplace = True)
    df_out.drop(lower[0], inplace = True)

    print(df.shape[0] - df_out .shape[0]  , " rows are removed from data")
    
    return df_out
In [20]:
dataset_wo_outliers = outlier_detection(dataset_new,'bmi')
9  rows are removed from data
In [21]:
sns_graph(dataset_wo_outliers, 'age', 'charges', None)
In [22]:
dataset_wo_outliers = dataset_wo_outliers[dataset_wo_outliers['charges'] < 50000]
dataset_wo_outliers.reset_index(drop=True,inplace=True)
In [23]:
sns_graph(dataset_wo_outliers, 'age', 'charges', None)

**Correlation of Independent Variables:**

In [24]:
'''

In order to estimate the health insurance payment, by looking at the bilateral relations of the observed variables with each other, the variables
dependencies can be checked.


With values close to -1, there is a strong inverse ratio between the two variables.
With values close to 1,  there is a strong direct ratio between the two variables.
With values close to 0,  there is no correlation.

'''

independent_variables = ['age','bmi',
                         'children','Sex_int',
                         'smoker_int','region_northwest',
                         'region_southeast','region_southwest']

dependent_variable = ['charges']
In [25]:
plt.rcParams["figure.figsize"] = [15.50, 12.50]
plt.rcParams["figure.autolayout"] = True

corr= dataset_wo_outliers[independent_variables].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr,cmap="RdYlGn", annot = True, annot_kws={"size": 14, "va": "center_baseline", "color": "black"},mask = mask,linewidth=3, vmin=-1, vmax=1)
plt.title('Correlation Matrix', fontweight = 'bold', fontsize = 16)
Out[25]:
Text(0.5, 1.0, 'Correlation Matrix')

When we look at the correlation map, it can be said that there is no linear correlation between the independent variables.

**Conformity to Normal Distribution**

https://ravenfo.com/2021/07/11/normal-dagilim-python-normallik-testi/

**Shapiro-Wilk Test**

  • It is used to test whether a data set is only normally distributed.
  • The test result shows the W stat value. The calculated W statistic takes a value between 0 and 1.
  • the W value close to 1, the more suitable the data is for a normal distribution.

! It is not recommended to be used in cases where the sample size is larger than 5,000, as it may create bias.

In [26]:
def Shapiro_Wilk_Test(df, parameter):
    
    print(str(parameter) + " test whether a data set is only normally distributed" )
    w, p = stats.shapiro(df[parameter])

    alpha = 0.05

    print('W Statictic:', w, '\np value:',p) 

    if p > alpha:
        print('The H0 hypothesis cannot be rejected. There is not enough evidence to say that the data is not normally distributed.\n')

    else:
        print('H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.\n')
In [27]:
[Shapiro_Wilk_Test(dataset_wo_outliers, parameter) for parameter in ['age', 'bmi']]
age test whether a data set is only normally distributed
W Statictic: 0.9447934627532959 
p value: 8.067505035029967e-22
H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.

bmi test whether a data set is only normally distributed
W Statictic: 0.9945359230041504 
p value: 9.369832696393132e-05
H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.

Out[27]:
[None, None]

Although the W statistic are close to 1, the value is not statistically significant since the p-value is smaller than the alpha value at the 95% confidence level..

**Kolmogorov-Smirnov Test**

  • Compares the distribution of the sample with the standard normal distribution with a mean of 0 and a standard deviation of 1.
  • The variable must be standardized.
In [28]:
def Kolmogorov_Smirnov_Test (df, parameter):
    
    print(str(parameter) + " test whether a data set is only normally distributed." )
    
    ks, p = stats.kstest(df[parameter], "norm")

    alpha = 0.05

    print('ks Statistic:', ks, '\np value:',p)  

    if p > alpha:
        print('The H0 hypothesis cannot be rejected. There is not enough evidence to say that the data is not normally distributed.\n')

    else:
        print('H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.\n')

Standardization:

**$z$ = $ x - \sigma \over \mu $**

x : Value to be converted

$\sigma$ : The average of the variable to be converted

$\mu$ : The standard deviation of the variable to be converted

In [29]:
def standardization(df, parameter):
    df[parameter] =  ( df[parameter]- df[parameter].mean() ) / df[parameter].std()
In [30]:
dataset_wo_outliers_stand = dataset_wo_outliers.copy()
[standardization(dataset_wo_outliers_stand, parameter) for parameter in ['age', 'bmi', 'children']]
dataset_wo_outliers_stand
Out[30]:
age bmi children charges Sex_int smoker_int region_northwest region_southeast region_southwest
0 -1.434282 -0.447437 -0.912024 16884.92400 0 1 0 0 1
1 -1.505455 0.524076 -0.083936 1725.55230 1 0 0 1 0
2 -0.793725 0.396637 1.572239 4449.46200 1 0 0 1 0
3 -0.437860 -1.307234 -0.912024 21984.47061 1 0 1 0 0
4 -0.509033 -0.285242 -0.912024 3866.85520 1 0 1 0 0
... ... ... ... ... ... ... ... ... ...
1317 0.772082 0.060663 1.572239 10600.54830 1 0 1 0 0
1318 -1.505455 0.217892 -0.912024 2205.98080 0 0 0 0 0
1319 -1.505455 1.033831 -0.912024 1629.83350 0 0 0 1 0
1320 -1.291936 -0.794997 -0.912024 2007.94500 0 0 0 0 1
1321 1.554986 -0.253796 -0.912024 29141.36030 0 1 1 0 0

1322 rows × 9 columns

In [31]:
[Kolmogorov_Smirnov_Test(dataset_wo_outliers_stand, parameter) for parameter in ['age', 'bmi']]
age test whether a data set is only normally distributed.
ks Statistic: 0.07906046056549959 
p value: 1.2358016493060771e-07
H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.

bmi test whether a data set is only normally distributed.
ks Statistic: 0.02605340424804603 
p value: 0.32522439680069026
The H0 hypothesis cannot be rejected. There is not enough evidence to say that the data is not normally distributed.

Out[31]:
[None, None]

There is insufficient evidence at the 95% confidence interval to state that the Body Mass Index is not normally distributed according to the Kolmogorov-Smirnov test.

In [32]:
def dist_plot(df, parameter):
    
    sns.displot(df[parameter])
    plt.xlabel(str(parameter), fontsize=14);
    plt.ylabel('Count', fontsize=14);
    plt.title(str(parameter)+ ' Distribution', fontsize=18, fontweight ='bold')
    plt.tick_params(axis='both', which='major', labelsize=14)
    plt.show()
    print('Skewness:', stats.skew(df[parameter]))
  • If the skewness is between -0.5 and +0.5, the distribution is approximately symmetrical..
In [33]:
[dist_plot(dataset_wo_outliers, parameter) for parameter in ['age', 'bmi', 'children','charges']]
Skewness: 0.06360208845683453
Skewness: 0.25427518233179686
Skewness: 0.93006284892155
Skewness: 1.44292532828262
Out[33]:
[None, None, None, None]
In [34]:
def QQ_plot(df, parameter): 
    plt.rcParams["figure.figsize"] = [4.50, 4.50]
    plt.rcParams["figure.autolayout"] = True
    stats.probplot(df[str(parameter)], dist = "norm", plot = plt)
    plt.title(str(parameter)+ " Q-Q Graph", fontsize=18, fontweight ='bold')
    plt.show()

According to the distribution tables given above:

  • Age distribution is uniform.
  • The distributions of the number of children and the Insurance payment are positively skewed.
  • Body Mass Index appears on the graph to have a normal distribution.

Variables other than Body Mass Index must be transformed for normal distribution.

In [35]:
[QQ_plot(dataset_wo_outliers, parameter) for parameter in ['age', 'bmi']]
Out[35]:
[None, None]

**Normal Transform**

https://medium.com/datarunner/veri-biliminde-normal-da%C4%9F%C4%B1lmayan-verilerin-d%C3%B6n%C3%BC%C5%9Ft%C3%BCr%C3%BClme-transformation-y%C3%B6ntemleri-logaritmik-ef316abb63f2

  • It is a statistical technique known to have a corrective effect on skewed data. The function determines the type of transforming the non-normal distribution to the normal distribution according to the lambda (λ) parameter. For Box-Cox, the lambda (λ) parameter has the range -5 <λ <5. The values of the lamp and the conversion type opposite are as follows.

    • lambda = -1 Reciprocal Tranformation
    • lambda = -0.5 Reciprocal Square root Transformation
    • lambda = 0.0 Logaritmic Transformation
    • lambda = 0.5 Square root Transformation
    • lambda = 1.0 No need tranform
  • Reciprocal Transformation: The value of X will be replaced by the inverse of X (1 / X). This change is not suitable for variables containing 0, as 1/X will go to infinity.
  • Square-Root Transformation: The value of X is replaced by the square root of X. The main advantage of the square root transformation is that it can be applied to zero values.
In [36]:
def norm_transform(method,df, parameter) :  
      
    # Orjinal değişken

    x_original = df[parameter].sort_values()
    
    if method == 'Box-Cox':
        
        x_sr, lam = stats.boxcox(x_original)
        
    elif method == 'Reciprocal Transformation':
        x_sr = 1 / x_original
        
    elif method == 'Square-Root':
        x_sr = x_original ** (1/2)
    
                                    
    df[parameter] = x_sr
    # original distribution
    mean, std = stats.norm.fit(x_original, loc=0)
    pdf_norm = stats.norm.pdf(x_original, mean, std)

    # Box-Cox distribution
    mean, std = stats.norm.fit(x_sr, loc=0)
    pdf_norm_sk = stats.norm.pdf(x_sr, mean, std)

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))    
    # Original values and their distributions
    ax1.hist(x_original, bins= 20 , density=True
             ,color = "grey", ec= "skyblue")

    ax1.plot(x_original, pdf_norm
             ,color = "red", linewidth=4, linestyle=':')

    ax1.set_ylabel('Probability')
    ax1.set_title('Original Shape')

    # Box-Cox transformation
    ax2.hist(x_sr, bins= 20, density=True
             ,color = "green", ec="skyblue")

    ax2.plot(x_sr, pdf_norm_sk
             ,color = "red", linewidth=4, linestyle=':')

    print(str(parameter))
    ax2.set_title(str(method) + ' Transformation')
    plt.show()
    df[parameter] = x_sr
In [37]:
dataset_new_box_cox = dataset_wo_outliers.copy()
[norm_transform('Box-Cox', dataset_new_box_cox, parameter) for parameter in ['age']]
age
Out[37]:
[None]
In [38]:
dataset_new_box_cox
Out[38]:
age bmi children charges Sex_int smoker_int region_northwest region_southeast region_southwest
0 7.931435 27.900 0 16884.92400 0 1 0 0 1
1 7.931435 33.770 1 1725.55230 1 0 0 1 0
2 7.931435 33.000 3 4449.46200 1 0 0 1 0
3 7.931435 22.705 0 21984.47061 1 0 1 0 0
4 7.931435 28.880 0 3866.85520 1 0 1 0 0
... ... ... ... ... ... ... ... ... ...
1317 19.129377 30.970 3 10600.54830 1 0 1 0 0
1318 19.129377 31.920 0 2205.98080 0 0 0 0 0
1319 19.129377 36.850 0 1629.83350 0 0 0 1 0
1320 19.129377 25.800 0 2007.94500 0 0 0 0 1
1321 19.129377 29.070 0 29141.36030 0 1 1 0 0

1322 rows × 9 columns

In [39]:
[Shapiro_Wilk_Test(dataset_new_box_cox, parameter) for parameter in ['age']]
age test whether a data set is only normally distributed
W Statictic: 0.9440970420837402 
p value: 5.959806021185804e-22
H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.

Out[39]:
[None]
In [40]:
'''
The parameters are standardized for Kolmogorov-Smirnov Test .
'''
dataset_new_box_cox_stand = dataset_new_box_cox.copy()
[standardization(dataset_new_box_cox_stand, parameter) for parameter in ['age', 'bmi']]
dataset_new_box_cox_stand
Out[40]:
age bmi children charges Sex_int smoker_int region_northwest region_southeast region_southwest
0 -1.618489 -0.447437 0 16884.92400 0 1 0 0 1
1 -1.618489 0.524076 1 1725.55230 1 0 0 1 0
2 -1.618489 0.396637 3 4449.46200 1 0 0 1 0
3 -1.618489 -1.307234 0 21984.47061 1 0 1 0 0
4 -1.618489 -0.285242 0 3866.85520 1 0 1 0 0
... ... ... ... ... ... ... ... ... ...
1317 1.638929 0.060663 3 10600.54830 1 0 1 0 0
1318 1.638929 0.217892 0 2205.98080 0 0 0 0 0
1319 1.638929 1.033831 0 1629.83350 0 0 0 1 0
1320 1.638929 -0.794997 0 2007.94500 0 0 0 0 1
1321 1.638929 -0.253796 0 29141.36030 0 1 1 0 0

1322 rows × 9 columns

In [41]:
[Kolmogorov_Smirnov_Test(dataset_new_box_cox_stand, parameter) for parameter in ['age']]
age test whether a data set is only normally distributed.
ks Statistic: 0.07619713993449828 
p value: 4.022893110721751e-07
H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.

Out[41]:
[None]

Best linear unbiased prediction (BLUE)

https://statisticalhorizons.com/is-ols-blue-or-bue/

https://acikders.ankara.edu.tr/pluginfile.php/30758/mod_resource/content/0/6_Uygun%20Hipotez%20Testinin%20Se%C3%A7imi.pdf

In [42]:
dataset_new_box_cox_1= dataset_wo_outliers.copy()
[norm_transform('Box-Cox', dataset_new_box_cox_1, parameter) for parameter in ['charges']]
charges
Out[42]:
[None]
In [43]:
dataset_new_box_cox_stand_1 = dataset_new_box_cox_1.copy()
[standardization(dataset_new_box_cox_stand_1, parameter) for parameter in ['charges']]
[Kolmogorov_Smirnov_Test(dataset_new_box_cox_stand_1, parameter) for parameter in ['charges']]
charges test whether a data set is only normally distributed.
ks Statistic: 0.03920828435055035 
p value: 0.03342985390288211
H0 hypothesis is rejected. There is sufficient evidence to say that the data is not normally distributed.

Out[43]:
[None]

**Multiple Regression**

AIC means Akaike’s Information Criteria. --> present the danger that it would outfit

BIC means Bayesian Information Criteria. --> present the danger that it would underfit

http://www.differencebetween.net/miscellaneous/difference-between-aic-and-bic/#ixzz7p8lyEOdo

In [44]:
dataset_wo_outliers.exog = sm.add_constant(dataset_wo_outliers[independent_variables], prepend=False)

# Fit and summarize OLS model
mod = sm.OLS(dataset_wo_outliers[dependent_variable], dataset_wo_outliers.exog )

res = mod.fit()

print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                charges   R-squared:                       0.753
Model:                            OLS   Adj. R-squared:                  0.752
Method:                 Least Squares   F-statistic:                     501.3
Date:                Tue, 20 Jun 2023   Prob (F-statistic):               0.00
Time:                        20:03:33   Log-Likelihood:                -13337.
No. Observations:                1322   AIC:                         2.669e+04
Df Residuals:                    1313   BIC:                         2.674e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
age                256.0187     11.532     22.201      0.000     233.396     278.642
bmi                327.4143     27.966     11.708      0.000     272.552     382.277
children           511.5555    133.402      3.835      0.000     249.852     773.259
Sex_int            -97.8954    322.872     -0.303      0.762    -731.298     535.507
smoker_int        2.322e+04    402.789     57.636      0.000    2.24e+04     2.4e+04
region_northwest  -453.6206    461.579     -0.983      0.326   -1359.133     451.892
region_southeast -1013.7175    463.753     -2.186      0.029   -1923.495    -103.940
region_southwest -1023.1013    463.766     -2.206      0.028   -1932.905    -113.298
const            -1.154e+04    964.105    -11.971      0.000   -1.34e+04   -9649.958
==============================================================================
Omnibus:                      261.911   Durbin-Watson:                   2.096
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              527.858
Skew:                           1.149   Prob(JB):                    2.38e-115
Kurtosis:                       5.075   Cond. No.                         313.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\msahin42\AppData\Local\Temp\ipykernel_10984\2032741484.py:1: UserWarning:

Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

**Formula**

$Charge$ = -11540 + 256.0187 Age + 327.4143 BMI + 511.5555 Children - 97.8954 Sex + 23220 smoker - 453.6206 Northwest - 1013.7175 Southeast - 1023.1013 Soutwest

In [45]:
plt.rcParams["figure.figsize"] = [12.50, 7.50]
plt.scatter (dataset_wo_outliers.age , dataset_wo_outliers['charges'] , color = "red")
plt.scatter( dataset_wo_outliers.age, res.predict(dataset_wo_outliers.exog), color = "blue")
plt.xlabel ("age", fontweight = 'bold', fontsize=13)
plt.ylabel  ( "Charges", fontweight = 'bold', fontsize=13)
plt.title ( "Age - Charges" , fontweight = 'bold', fontsize=16)

plt.show()

**Finding Beta Coefficient**

In [46]:
dataset_new_stand = dataset_wo_outliers.copy()
[standardization(dataset_new_stand, parameter) for parameter in ['age', 'bmi', 'children']]

dataset_new_stand.exog = sm.add_constant(dataset_new_stand[independent_variables], prepend=False)

# Fit and summarize OLS model
mod = sm.OLS(dataset_new_stand[dependent_variable], dataset_new_stand.exog )

result = mod.fit()

print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                charges   R-squared:                       0.753
Model:                            OLS   Adj. R-squared:                  0.752
Method:                 Least Squares   F-statistic:                     501.3
Date:                Tue, 20 Jun 2023   Prob (F-statistic):               0.00
Time:                        20:03:33   Log-Likelihood:                -13337.
No. Observations:                1322   AIC:                         2.669e+04
Df Residuals:                    1313   BIC:                         2.674e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
age               3597.1296    162.026     22.201      0.000    3279.272    3914.987
bmi               1978.2767    168.973     11.708      0.000    1646.789    2309.764
children           617.7554    161.096      3.835      0.000     301.721     933.789
Sex_int            -97.8954    322.872     -0.303      0.762    -731.298     535.507
smoker_int        2.322e+04    402.789     57.636      0.000    2.24e+04     2.4e+04
region_northwest  -453.6206    461.579     -0.983      0.326   -1359.133     451.892
region_southeast -1013.7175    463.753     -2.186      0.029   -1923.495    -103.940
region_southwest -1023.1013    463.766     -2.206      0.028   -1932.905    -113.298
const             9065.7607    374.747     24.192      0.000    8330.592    9800.930
==============================================================================
Omnibus:                      261.911   Durbin-Watson:                   2.096
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              527.858
Skew:                           1.149   Prob(JB):                    2.38e-115
Kurtosis:                       5.075   Cond. No.                         5.63
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\msahin42\AppData\Local\Temp\ipykernel_10984\1824816278.py:4: UserWarning:

Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

**Conclusion and Recommendations**

Taking into account the age, number of children, place of residence, body mass index, gender, and smoking status of a new customer, insurance payments can be predicted by using the multiple regression model, which is explainable at a rate of 75.3%. The fact that an individual smokes increases the insurance payment, and it can be concluded that there is a positive relationship between age and insurance payment. Increasing body mass index also increases insurance payments. However, as mentioned in the last paragraph of the findings and evaluation section, in reality insurance payments are grouped in three different areas. The established model was somewhat inadequate in separating these three groups. During the analysis of the data, it was stated that little data were collected about smokers and there was an imbalance. For future studies, increasing the number of smokers and repeating the analyzes can make a positive contribution to the model results. In addition, with the increase in the number of data, the normality will increase and the sample will better represent the main population.